data generation
Machine Learning for Network Attacks Classification and Statistical Evaluation of Adversarial Learning Methodologies for Synthetic Data Generation
Zarkadis, Iakovos-Christos, Douligeris, Christos
Supervised detection of network attacks has always been a critical part of network intrusion detection systems (NIDS). Nowadays, in a pivotal time for artificial intelligence (AI), with even more sophisticated attacks that utilize advanced techniques, such as generative artificial intelligence (GenAI) and reinforcement learning, it has become a vital component if we wish to protect our personal data, which are scattered across the web. In this paper, we address two tasks, in the first unified multi-modal NIDS dataset, which incorporates flow-level data, packet payload information and temporal contextual features, from the reprocessed CIC-IDS-2017, CIC-IoT-2023, UNSW-NB15 and CIC-DDoS-2019, with the same feature space. In the first task we use machine learning (ML) algorithms, with stratified cross validation, in order to prevent network attacks, with stability and reliability. In the second task we use adversarial learning algorithms to generate synthetic data, compare them with the real ones and evaluate their fidelity, utility and privacy using the SDV framework, f-divergences, distinguishability and non-parametric statistical tests. The findings provide stable ML models for intrusion detection and generative models with high fidelity and utility, by combining the Synthetic Data Vault framework, the TRTS and TSTR tests, with non-parametric statistical tests and f-divergence measures.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- North America > United States > New York (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- (3 more...)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.90)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.34)
EmDT: Embedding Diffusion Transformer for Tabular Data Generation in Fraud Detection
Imbalanced datasets pose a difficulty in fraud detection, as classifiers are often biased toward the majority class and perform poorly on rare fraudulent transactions. Synthetic data generation is therefore commonly used to mitigate this problem. In this work, we propose the Clustered Embedding Diffusion-Transformer (EmDT), a diffusion model designed to generate fraudulent samples. Our key innovation is to leverage UMAP clustering to identify distinct fraudulent patterns, and train a Transformer denoising network with sinusoidal positional embeddings to capture feature relationships throughout the diffusion process. Once the synthetic data has been generated, we employ a standard decision-tree-based classifier (e.g., XGBoost) for classification, as this type of model remains better suited to tabular datasets. Experiments on a credit card fraud detection dataset demonstrate that EmDT significantly improves downstream classification performance compared to existing oversampling and generative methods, while maintaining comparable privacy protection and preserving feature correlations present in the original data.
- North America > United States > Arizona > Maricopa County > Tempe (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- North America > United States > Massachusetts (0.04)
- North America > United States > California > Alameda County > Berkeley (0.04)
- Asia > Indonesia > Bali (0.04)
- North America > United States > District of Columbia > Washington (0.04)
- North America > Mexico (0.04)
- Africa (0.04)
- (6 more...)
- Research Report > New Finding (0.92)
- Personal (0.67)
- Media > Film (1.00)
- Leisure & Entertainment > Games > Computer Games (1.00)
- Law (1.00)
- (14 more...)
- Asia > China > Beijing > Beijing (0.04)
- North America > United States > California > Los Angeles County > Long Beach (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- (8 more...)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
- Asia > South Korea > Daejeon > Daejeon (0.04)
- North America > United States > California (0.04)
- North America > Nicaragua (0.04)
- North America > Montserrat (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Health & Medicine > Therapeutic Area (1.00)
- Information Technology (0.92)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- North America > United States > Texas > Tarrant County > Fort Worth (0.04)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- Europe > France (0.14)
- North America > United States > California > Alameda County > Berkeley (0.04)
7 Checklist
For all authors... (a) Do the main claims made in the abstract and introduction accurately reflect the paper's contributions and scope? If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Y es] We release the code and the models If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if applicable? [Y es] We included the instructions given to participants in appendix F. In this appendix, we describe the neural network architecture used for our agents.Figure 2: Transformer encoder (left) used in both policy proposal network (center) and value network (right). Our model architecture is shown in Figure 2. It is essentially identical to the architecture in [11], except that it replaces the specialized graph-convolution-based encoder with a much simpler transformer encoder, removes all dropout layers, and uses separate policy and value networks. Aside from the encoder, the other aspects of the architecture are the same, notably the LSTM policy decoder, which decodes orders through sequential attention over each successive location in the encoder output to produce an action. The input to our new encoder is also identical to that of [11], consisting of the same representation of the current board state, previous board state, and a recent order embedding. Rather than processing various parts of this input in two parallel trunks before combining them into a shared encoder trunk, we take the simpler approach of concatenating all features together at the start, resulting in 146 feature channels across each of 81 board locations (75 region + 6 coasts). We pass this through a linear layer, add pointwise a learnable per-position per-channel bias, and then pass this to a standard transformer encoder architecture.